Abstract: There are very big bottlenecks when traditional data mining algorithms deal with large data sets. A novel technique for clustering the large and high dimensional datasets. The main idea is to use an inexpensive and approximate distance measure in order to efficiently partition the data into overlapping subsets which is called as canopies. After we get these canopies the desired clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical and efficient. K-Means is typical distance-based clustering algorithm. Here, the canopy clustering algorithm is implemented as an efficient clustering technique by means of knowledge integration. With the study of the canopy clustering the K-Means paradigm of computing, we find is appropriate for the implementation of a clustering algorithm. This paper shows some advantages of canopy cluster to K-Means clustering mechanism and proposes a pre-clustering approach to K-Means Clustering method. Here we use Hadoop’s MapReduce program model for K-Means clustering with canopy clustering. The experimental results show that Canopy + K-means algorithm has faster operation speed than K-means algorithm, but both of them show good speed-up ratio under Hadoop environment and Canopy + K-means algorithm is even much better K-means algorithm.

Keywords: Abnormal Event Detection, Outlier Detection, Video Data Stream, Sparse Learning, Dynamic Detection.